{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/embeddings/custom_embeddings.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Custom Embeddings\n",
    "LlamaIndex supports embeddings from OpenAI, Azure, and Langchain. But if this isn't enough, you can also implement any embeddings model!\n",
    "\n",
    "The example below uses Instructor Embeddings ([install/setup details here](https://huggingface.co/hkunlp/instructor-large)), and implements a custom embeddings class. Instructor embeddings work by providing text, as well as \"instructions\" on the domain of the text to embed. This is helpful when embedding text from a very specific and specialized topic.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install dependencies\n",
    "# !pip install InstructorEmbedding torch transformers sentence-transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import openai\n",
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"\n",
    "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Custom Embeddings Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Any, List\n",
    "from InstructorEmbedding import INSTRUCTOR\n",
    "\n",
    "from llama_index.core.bridge.pydantic import PrivateAttr\n",
    "from llama_index.core.embeddings import BaseEmbedding\n",
    "\n",
    "\n",
    "class InstructorEmbeddings(BaseEmbedding):\n",
    "    _model: INSTRUCTOR = PrivateAttr()\n",
    "    _instruction: str = PrivateAttr()\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        instructor_model_name: str = \"hkunlp/instructor-large\",\n",
    "        instruction: str = \"Represent a document for semantic search:\",\n",
    "        **kwargs: Any,\n",
    "    ) -> None:\n",
    "        super().__init__(**kwargs)\n",
    "        self._model = INSTRUCTOR(instructor_model_name)\n",
    "        self._instruction = instruction\n",
    "\n",
    "    @classmethod\n",
    "    def class_name(cls) -> str:\n",
    "        return \"instructor\"\n",
    "\n",
    "    async def _aget_query_embedding(self, query: str) -> List[float]:\n",
    "        return self._get_query_embedding(query)\n",
    "\n",
    "    async def _aget_text_embedding(self, text: str) -> List[float]:\n",
    "        return self._get_text_embedding(text)\n",
    "\n",
    "    def _get_query_embedding(self, query: str) -> List[float]:\n",
    "        embeddings = self._model.encode([[self._instruction, query]])\n",
    "        return embeddings[0]\n",
    "\n",
    "    def _get_text_embedding(self, text: str) -> List[float]:\n",
    "        embeddings = self._model.encode([[self._instruction, text]])\n",
    "        return embeddings[0]\n",
    "\n",
    "    def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:\n",
    "        embeddings = self._model.encode(\n",
    "            [[self._instruction, text] for text in texts]\n",
    "        )\n",
    "        return embeddings"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usage Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader, VectorStoreIndex\n",
    "from llama_index.core import Settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load Documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "load INSTRUCTOR_Transformer\n",
      "max_seq_length  512\n"
     ]
    }
   ],
   "source": [
    "embed_model = InstructorEmbeddings(embed_batch_size=2)\n",
    "\n",
    "Settings.embed_model = embed_model\n",
    "Settings.chunk_size = 512\n",
    "\n",
    "# if running for the first time, will download model weights first!\n",
    "index = VectorStoreIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The author wrote short stories and also worked on programming, specifically on an IBM 1401 computer in 9th grade. They used an early version of Fortran and had to type programs on punch cards. Later on, they got a microcomputer, a TRS-80, and started programming more extensively, writing simple games and a word processor. They initially planned to study philosophy in college but eventually switched to AI.\n"
     ]
    }
   ],
   "source": [
    "response = index.as_query_engine().query(\"What did the author do growing up?\")\n",
    "print(response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
